A food delivery company is committed to providing a delivery experience that delights their customers while still being incredibly efficient. Thus it is critical that they have the best possible model of how long it takes for a food order to be prepared. This allows them to ensure that a rider arrives at a restaurant to pick up an order exactly when the food is ready. The aim of this exercise is to use historical data, clean it, manipulate it to extract insights, then finally predict the food preparation time for each order.
The aim of this notebook is to show the process we followed by describing the steps and elaborating a bit more on the visuals alongside with the code that generates them. It is not what we would present to the stakeholder or to non-technical audience. A presentation that you will find in the github repo is dedicated to non technical audience and showcases my storytelling.
# install.packages('tidyverse')
# install.packages('skimr')
# install.packages('lubridate')
# install.packages('cowplot')
# install.packages('plotly')
# install.packages('GGally')
library(tidyverse) # General purpose
library(skimr) # EDA
library(lubridate) # Dates manipulation
library(cowplot) # Plot multiple plots
library(plotly) # Interancive plots
library(GGally) # Pairplot
# Load data
orders <- read.csv('orders.csv')
restaurants <- read.csv('restaurants.csv')
# Structure of the datasets
str(orders)
## 'data.frame': 32394 obs. of 6 variables:
## $ order_acknowledged_at: chr "2015-06-01 12:28:28.952789+01:00" "2015-06-06 17:06:24.434807+01:00" "2015-06-08 14:56:15.503204+01:00" "2015-06-12 15:12:20.497925+01:00" ...
## $ order_ready_at : chr "2015-06-01 14:12:09.474896+01:00" "2015-06-06 17:16:27.520253+01:00" "2015-06-08 15:03:39.397496+01:00" "2015-06-12 15:23:30.064683+01:00" ...
## $ order_value_gbp : num 59.9 24 15.2 28.1 56.3 ...
## $ restaurant_id : int 1326 1326 1326 1326 255 255 255 255 255 255 ...
## $ number_of_items : int 2 8 3 8 7 3 4 4 3 13 ...
## $ prep_time_seconds : int 6220 603 443 669 3314 1049 1468 1300 1254 5209 ...
str(restaurants)
## 'data.frame': 1697 obs. of 4 variables:
## $ restaurant_id: int 3 5 7 8 9 10 12 13 14 15 ...
## $ country : chr "UK" "UK" "UK" "UK" ...
## $ city : chr "London" "London" "London" "London" ...
## $ type_of_food : chr "thai" "italian" "italian" "chinese" ...
It seems that we have 32,394 observations and 6 variables in orders dataset whereas the restaurant dataset has 1,697 rows and 4 columns. By checking that restaurant_id is the unique identifier for the two datasets we can join them together. This way we can have the order’s information along with the country, the city and the type of food of the restaurant.
# Inner join the data to retrieve all the info we need from both the given tables - remove duplicates
orders <- orders %>% inner_join(restaurants, 'restaurant_id') %>% unique()
# Create features to facilitate EDA
orders <- orders %>%
mutate(prep_time_minutes = prep_time_seconds/60, # Preparation time in minutes
prep_time_hours = prep_time_seconds/3600, # Preparation time in hours
order_placed_at = strptime(order_acknowledged_at, # Conversion into timestamp in GMT (UK time)
"%Y-%m-%d %H:%M:%S",
tz = "gmt"),
order_ready_at = strptime(order_ready_at, # Conversion into timestamp in GMT (UK time)
"%Y-%m-%d %H:%M:%S",
tz = "gmt"),
restaurant_id = as.factor(restaurant_id), # Change data type from character to factor
country = as.factor(country), # Change data type from character to factor
city = as.factor(city), # Change data type from character to factor
type_of_food = as.factor(type_of_food), # Change data type from character to factor
date_of_order = date(order_placed_at), # Date that the order was acknowledged
date_order_prepared = date(order_ready_at), # Date that the order was acknowledged
day_name = lubridate::wday(date_of_order, label = TRUE), # Name of the day the order has been placed
hour_of_the_day = as.factor(hour(order_placed_at)) # Hour of the day where the order has been placed
) %>%
select(-order_acknowledged_at, -order_ready_at) # Remove those two columns as they are not needed
# Use skim function to gain more info regarding missing values and high level statistics
skim(orders)
| Name | orders |
| Number of rows | 32394 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| Date | 2 |
| factor | 6 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| order_placed_at | 0 | 1 | 97376 | 227208 | 0 | 31743 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date_of_order | 0 | 1 | 2015-06-01 | 2015-07-01 | 2015-06-16 | 31 |
| date_order_prepared | 0 | 1 | 2015-06-01 | 2015-07-01 | 2015-06-16 | 31 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| restaurant_id | 0 | 1 | FALSE | 1697 | 408: 679, 365: 339, 193: 312, 20: 281 |
| country | 0 | 1 | FALSE | 4 | UK: 29689, Fra: 2032, Ire: 353, Ger: 320 |
| city | 0 | 1 | FALSE | 22 | Lon: 25481, Par: 2032, Man: 964, Bri: 652 |
| type_of_food | 0 | 1 | FALSE | 83 | ita: 4759, bur: 3916, tha: 2891, ame: 2395 |
| day_name | 0 | 1 | TRUE | 7 | Sun: 5501, Sat: 4929, Fri: 4888, Tue: 4689 |
| hour_of_the_day | 0 | 1 | FALSE | 18 | 20: 6691, 21: 6001, 19: 4129, 22: 3640 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| order_value_gbp | 0 | 1 | 26.55 | 19.26 | 0.5 | 16.70 | 21.40 | 30.25 | 711.00 | ▇▁▁▁▁ |
| number_of_items | 0 | 1 | 4.03 | 2.39 | 1.0 | 3.00 | 3.00 | 5.00 | 38.00 | ▇▁▁▁▁ |
| prep_time_seconds | 0 | 1 | 1299.99 | 4404.57 | 0.0 | 518.00 | 820.00 | 1227.00 | 255499.00 | ▇▁▁▁▁ |
| prep_time_minutes | 0 | 1 | 21.67 | 73.41 | 0.0 | 8.63 | 13.67 | 20.45 | 4258.32 | ▇▁▁▁▁ |
| prep_time_hours | 0 | 1 | 0.36 | 1.22 | 0.0 | 0.14 | 0.23 | 0.34 | 70.97 | ▇▁▁▁▁ |
# Check if the number of rows of the above are the same
# In case they are it means that each restaurant has only one type of food
dim(orders %>% count(restaurant_id, type_of_food))[1] == dim(restaurants)[1] # Same number of rows
## [1] TRUE
#orders %>% filter(prep_time_hours > 0.34 + 1.5 * IQR(orders$prep_time_hours)) %>% arrange(desc(prep_time_hours))
There are obviously cases that the order happened late at night and it was delivered very early next day (for example order at 11pm and delivery next day at 1 am) which sounds like a normal scenario but it seems that there are also cases that the order happened a day or a few days in advance. In order to distinguish between the orders mentioned above let’s create a flag indicating when an order is set for delivery next day(s) excluding the cases where the order just took place very late at night and was delivered within 3 hours (this number is taken as a reasonable but a bit high value to cover a big range or orders in this business context).
# Remove the 103 observations that have 0 prep time
# Create a flag to differentiate between the above-mentioned orders
orders <- orders %>%
filter(prep_time_seconds != 0) %>%
mutate(flag_diff_day = ifelse(date_of_order != date_order_prepared & prep_time_hours > 3,
'Next_day_order',
'Same_day_order'))
# Statistics about those categories
orders %>%
group_by(flag_diff_day) %>%
summarise(counts = n(),
average_prep_time = mean(prep_time_hours),
min_prep_time = min(prep_time_hours),
max_prep_time = max(prep_time_hours))
## # A tibble: 2 × 5
## flag_diff_day counts average_prep_time min_prep_time max_prep_time
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Next_day_order 74 23.4 14.5 71.0
## 2 Same_day_order 32217 0.309 0.00111 10.5
Same day orders have a maximum prep time of a bit more than 10 hours which can be a valid thought to order at morning or noon for dinner to be delivered at night. The average prep time of the same day orders is about half an hour. For the modeling purposes and for better visualization those 74 ‘abnormal’ observations as the table above shows will be deleted. (Normally, we would perform proper outlier analysis to understand what is going on with them and if they should not been removed. Potentially a different model could be created for those cases). In the graph above, with red color are represented the ‘next day’ orders which will be removed for the purposes of this analysis, and with blue color the order that the analysis is based on. On the x-axis there is the cost of the order and on the y-axis we have the preparation time in hours. This plot shows the relationship of the two. On later stage we will examine in more detail.
# Plot order price vs preparation time to visualize extreme values
# Interactive plot
ggplotly(orders %>%
ggplot(aes(x = order_value_gbp, y = prep_time_hours, color = flag_diff_day)) +
geom_point(alpha = 0.7) +
labs(title = 'Order price vs preparation time to understand outliers',
x = 'Order cost', y = 'Preparation time in hours'))
# Remove next day deliveries
orders <- orders %>% filter(flag_diff_day == 'Same_day_order')
# Distribution of number of items
plt1 <- ggplot(orders, aes(x = number_of_items)) + geom_histogram(bins = 50, fill = 'steelblue') +
labs(x = '', y = 'Frequency', title = 'Distribution of order items')
plt2 <- ggplot(orders, aes(x = number_of_items)) + geom_boxplot(fill = 'steelblue') +
labs(x = 'Number of items') +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
cowplot::plot_grid(plt1, plt2,
ncol = 1, rel_heights = c(2, 1),
align = 'v', axis = 'lr')
# Distribution of order costs
plt3 <- ggplot(orders, aes(x = order_value_gbp)) + geom_histogram(bins = 500, fill = 'steelblue') +
labs(x = '', y = 'Frequency', title = 'Distribution of order costs')
plt4 <- ggplot(orders, aes(x = order_value_gbp)) + geom_boxplot(fill = 'steelblue') +
labs(x = 'Order cost in GBP') +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
cowplot::plot_grid(plt3, plt4,
ncol = 1, rel_heights = c(2, 1),
align = 'v', axis = 'lr')
# Distribution of preparation time
plt5 <- ggplot(orders, aes(x = prep_time_hours)) + geom_histogram(bins = 500, fill = 'steelblue') +
labs(x = '', y = 'Frequency', title = 'Distribution of preparation time (in hours)')
plt6 <- ggplot(orders, aes(x = prep_time_hours)) + geom_boxplot(fill = 'steelblue') +
labs(x = 'Preparation time in hours') +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
cowplot::plot_grid(plt5, plt6,
ncol = 1, rel_heights = c(2, 1),
align = 'v', axis = 'lr')
All the three plots show right skewed distributions. Order prices and number of items in the orders will be predictors in our models and the fact that follow this kind of distributions will make our life a bit more difficult. However, by normalizing them we can overcome potential issues. The outlier removal helped our visuals as if we kept the 74 observations of next day delivery we would notice heavily skewed distributions. Due to skewed distributions the main metric we proceed with this analysis will be the median (median is the point above and below which half (50%) the observed data falls)
The vast majority of orders come from London with more than 25k of orders. Second highest amount of orders is Paris with around 2 thousands orders which seems to be the only city represented by France. The other two countries are Germany(Berlin and Munchen) and Ireland(Dublin).
# Plot the amount of orders by city color-coded by country
ggplotly(orders %>%
group_by(country, city) %>%
summarise(counts = n()) %>%
mutate(total = sum(counts), perc = round(counts/total, 2)) %>%
arrange(desc(counts)) %>%
ggplot(aes(x = fct_reorder(city, counts), y = counts, fill = country)) +
geom_col() +
coord_flip() +
labs(x = 'City', y = 'Number of orders', title = 'Number of orders by city'))
ggplotly(orders %>%
group_by(restaurant_id, type_of_food) %>%
summarise(number_of_orders = n()) %>%
arrange(desc(number_of_orders)) %>%
head(30) %>%
ggplot(aes(x = fct_reorder(restaurant_id, number_of_orders), y = number_of_orders)) +
geom_col() +
coord_flip() +
labs(x = 'Restaurant id',
y = 'Number of orders',
title = 'Top 30 restaurants based on order frequency'))
ggplotly(orders %>%
group_by(restaurant_id, type_of_food) %>%
summarise(median_prep_time = median(prep_time_hours), n = n()) %>%
arrange(desc(median_prep_time)) %>%
head(30) %>%
ggplot(aes(x = fct_reorder(restaurant_id, median_prep_time), y = median_prep_time)) +
geom_col(fill = 'darkred') +
coord_flip() +
labs(x = 'Restaurant id',
y = 'Median preparation time in hours',
title = 'Top 30 slowest restaurants based on median preparation time in hours'))
ggplotly(orders %>%
group_by(type_of_food) %>%
summarise(counts = n()) %>%
arrange(desc(counts)) %>%
head(30) %>%
ggplot(aes(x = fct_reorder(type_of_food, counts), y = counts)) +
geom_col(fill = 'darkgreen') +
coord_flip() +
labs(x = 'Type of food',
y = 'Number of orders',
title = 'Top 30 food types based on number of orders'))
Italian food seems to be the most prominent type of food that people order, followed by burgers and thai. It may happen that a lot of food types may overlap but assuming that the restaurant labeling is right we can proceed.
cowplot::plot_grid(
orders %>%
group_by(type_of_food) %>%
summarise(median_prep_time = median(prep_time_hours)) %>%
arrange(median_prep_time) %>%
head(10) %>%
ggplot(aes(x = fct_reorder(type_of_food, -median_prep_time), y = median_prep_time)) +
geom_col(fill = 'steelblue') +
coord_flip() +
labs(x = 'Food type',
y = 'Median preparation time',
title = 'Top 10 fastest median prep time by food type'),
orders %>%
group_by(type_of_food) %>%
summarise(median_prep_time = mean(prep_time_hours)) %>%
arrange(desc(median_prep_time)) %>%
head(10) %>%
ggplot(aes(x = fct_reorder(type_of_food, median_prep_time), y = median_prep_time)) +
geom_col(fill = 'darkred') +
coord_flip() +
labs(x = 'Food type',
y = 'Median preparation time',
title = 'Top 10 slowest median prep time by food type'),
ncol = 1, align = 'v', axis = 'lr')
ggplotly(orders %>%
group_by(hour_of_the_day) %>%
summarise(number_of_orders = n()) %>%
arrange(desc(number_of_orders)) %>%
ggplot(aes(x = hour_of_the_day, y = number_of_orders)) +
geom_col(fill = 'steelblue') +
labs(x = 'Hour of the day',
y = 'Number of orders',
title = 'Busiest time of the day'))
As expected there is a rise of orders one hour after noon where people eat their lunch. As the time passes there is a downward trend till 5pm and then an upward trend till 8 pm where the peak of orders is reached having more than 6k orders overall.
ggplotly(orders %>%
group_by(day_name) %>%
summarise(number_of_orders = n()) %>%
arrange(desc(number_of_orders)) %>%
ggplot(aes(x = day_name, y = number_of_orders)) +
geom_col(fill = 'steelblue') +
labs(x = 'Day of week',
y = 'Number of orders',
title = 'Busiest day of the week'))
It seems that Sunday is the busiest day and Thursday the most quiet in terms of orders’ frequency based on the first graph.
ggplotly(orders %>%
group_by(day_name) %>%
summarise(median_prep_time = median(prep_time_hours),
average_prep_time = mean(prep_time_hours)) %>%
arrange(desc(median_prep_time)) %>%
ggplot(aes(x = day_name, y = median_prep_time)) +
geom_point(aes(color = 'median_prep_time'), size = 4) +
geom_point(aes(y = average_prep_time, color = 'average_prep_time'), size = 4) +
labs(x = 'Day of week',
y = 'Prep time in hours',
title = 'Median vs average prep time by day of the week',
fill = "Dose (mg)") +
theme(legend.position = "right"))
The graph illustrates the median and the mean preparation time for the days of the week. It seems that Tuesday is the day that on average the preparation time is faster while the slowest average and median preparation happens on Fridays. Again we see that the median is consistently less than the average prep time showing that for all days there are orders that take long to prepare.
In the graph above you can see the average daily preparation time (red color) and the median daily preparation time (blue color). We observe that average values are consistently higher than median values since there are extreme values that impact the average value. For this reason we can see how the median behaves as a more robust measure that is not affected by outliers. The median preparation time is steadily around 0.21 - 0.23 hours whereas the average preparation time fluctuates heavily over time from 0.26 hours reaching the maximum of 1.56 hours on 1st July. By checking this particular day, we observed that there are only 4 observations and one of them has 5.7 hours of preparation, which explains this spike in the graph. In the modeling notebook we will remove this date for consistency as the month changes.
You can also hover your mouse over the linegraphs to check the daily values.
ggplotly(orders %>%
group_by(date_of_order) %>%
summarise(average_time = mean(prep_time_hours), median_time = median(prep_time_hours)) %>%
ggplot(aes(x = date_of_order, y = median_time)) +
geom_line(color = 'steelblue') +
geom_line(aes(y = average_time), color = 'tomato') +
labs(x = 'Date',
y = 'Preparation time in hours',
title = 'Daily average vs daily median preparation time in hours'))
# Show the 4 cases of the last date
orders %>% filter(date_of_order == '2015-07-01') %>% select(restaurant_id, city, type_of_food, prep_time_hours)
## restaurant_id city type_of_food prep_time_hours
## 1 9 Paris korean 0.2150000
## 2 9 Paris korean 0.2036111
## 3 811 London italian 5.6838889
## 4 2728 Edinburgh chicken 0.1341667
# Remove 1st July data
orders <- orders %>% filter(date_of_order != '2015-07-01')
# Keep the numeric variables to check linear correlation and remove time in secs and minutes
pearson_cor_variables <- orders %>% keep(is.numeric) %>% select(-prep_time_seconds, -prep_time_minutes)
# Create correlation table
correlation_table <- cor(pearson_cor_variables)
# Plot the relationships, distributions and correlation coefficients by pairs
ggpairs(pearson_cor_variables, ggplot2::aes(alpha = 0.5))
As expected the higher the number of items in an order the higher the price having a moderate positive linear relationship with a coefficient of about 0.55. We would also expect higher correlation between preparation time and number of items but we have coefficient of 0.11 which shows approximately no linear relationship between the two. A very weak positive linear relationship appears between order price and prep time (0.18). The fact that there is not strong linear relationship is not necessarily bad indication as 1. we avoid the issue of multicollinearity between the predictors and 2. maybe there is another type of relationship that can help us having good predictions for our model